Towards a Cloud-based Approach for Spam Url Deduplication for Big Datasets
نویسندگان
چکیده
Spam emails are often used to advertise phishing websites and lure users to visit such sites. URL blacklisting is a widely used technique for blocking malicious phishing websites. To prepare an effective blacklist, it is necessary to analyze possible threats and include the identified malicious sites in the blacklist. However, the number of URLs acquired from spam emails is quite large. Fetching and analyzing the content of this large number of websites are very expensive tasks given limited computing and storage resources. To solve the problem of massive computing and storage resource requirements, we need a highly distributed and scalable architecture, where we can provision additional resources to fetch and analyze on the fly. Moreover, there is a high degree of redundancy in the URLs extracted from spam emails, where more than one spam emails contain the same URL. Hence, preserving the contents of all the websites causes significant storage waste. Additionally, fetching content from a fixed IP address introduces the possibility of being reversed blacklisted by malicious websites. In this paper, we propose and develop CURLA – a Cloud-based spam URL Analyzer, built on top of Amazon Elastic Computer Cloud (EC2) and Amazon Simple Queue Service (SQS). CURLA allows deduplicating large number of spam-based URLs in parallel, which reduces the cost of establishing equally capable local infrastructure. Our system builds a database of unique spam-based URL and accumulates the content of these unique websites in a central repository. This database and website repository will be a great resource to identify phishing websites and other counterfeit websites. We show the effectiveness of our architecture using real-life, large-scale spam-based URL data.
منابع مشابه
A Review Paper on Hybrid Cloud Approach for Secure Authorized Data Deduplication
Cloud computing is best concept to handle big database as the world is moving towards digitization. The amount of digital data in the world is growing exponentially with time. Thus, employing storage optimization techniques is an essential requirement to large storage areas like cloud storage. Cloud computing is best concept to handle big datasets. Data de the best storage optimization techniqu...
متن کاملFeature-based Malicious URL and Attack Type Detection Using Multi-class Classification
Nowadays, malicious URLs are the common threat to the businesses, social networks, net-banking etc. Existing approaches have focused on binary detection i.e. either the URL is malicious or benign. Very few literature is found which focused on the detection of malicious URLs and their attack types. Hence, it becomes necessary to know the attack type and adopt an effective countermeasure. This pa...
متن کاملEfficient Skew Handling for Outer Joins in a Cloud Computing Environment
Outer joins are ubiquitous in many workloads and Big Data systems. The question of how to best execute outer joins in large parallel systems is particularly challenging, as real world datasets are characterized by data skew leading to performance issues. Although skew handling techniques have been extensively studied for inner joins, there is little published work solving the corresponding prob...
متن کاملA Conceptual Framework for Smart Hospital towards Industry 4.0
Background: The fourth industrial revolution consists of combining network devices with cloud computing methods and analyzing large data and artificial intelligence, which makes it possible to call such an infrastructure smart. In a Smart Hospital, all things and devices are designed to be connected and integrated, thus achieving better patient care, increasing efficiency and reducing time wast...
متن کاملA Classification Method for E-mail Spam Using a Hybrid Approach for Feature Selection Optimization
Spam is an unwanted email that is harmful to communications around the world. Spam leads to a growing problem in a personal email, so it would be essential to detect it. Machine learning is very useful to solve this problem as it shows good results in order to learn all the requisite patterns for classification due to its adaptive existence. Nonetheless, in spam detection, there are a large num...
متن کامل